Page 1 of 7

2018 Journal article Open Access

X-CLEaVER: Learning ranking ensembles by growing and pruning trees
Lucchese C., Nardini F. M., Orlando S., Perego R., Silvestri F., Trani S.
Learning-to-Rank (LtR) solutions are commonly used in large-scale information retrieval systems such as Web search engines, which have to return highly relevant documents in response to user query within fractions of seconds. The most effective LtR algorithms adopt a gradient boosting approach to build additive ensembles of weighted regression trees. Since the required ranking effectiveness is achieved with very large ensembles, the impact on response time and query throughput of these solutions is not negligible. In this article, we propose X-CLEaVER, an iterative meta-algorithm able to build more efcient and effective ranking ensembles. X-CLEaVER interleaves the iterations of a given gradient boosting learning algorithm with pruning and re-weighting phases. First, redundant trees are removed from the given ensemble, then the weights of the remaining trees are fne-tuned by optimizing the desired ranking quality metric. We propose and analyze several pruning strategies and we assess their benefts showing that interleaving pruning and re-weighting phases during learning is more effective than applying a single post-learning optimization step. Experiments conducted using two publicly available LtR datasets show that X-CLEaVER can be successfully exploited on top of several LtR algorithms as it is effective in optimizing the effectiveness of the learnt ensembles, thus obtaining more compact forests that hence are much more efcient at scoring time.Source: ACM transactions on intelligent systems and technology (Print) 9 (2018). doi:10.1145/3205453
DOI: 10.1145/3205453
DOI: 10.5281/zenodo.2668362
DOI: 10.5281/zenodo.2668361
Project(s): BigDataGrapes via OpenAIRE

, SoBigData via OpenAIRE

Metrics:

2018 Conference article Open Access

Selective gradient boosting for effective learning to rank
Lucchese C., Nardini F. M., Perego R., Orlando S., Trani S.
Learning an effective ranking function from a large number of query-document examples is a challenging task. Indeed, training sets where queries are associated with a few relevant documents and a large number of irrelevant ones are required to model real scenarios of Web search production systems, where a query can possibly retrieve thousands of matching documents, but only a few of them are actually relevant. In this paper, we propose Selective Gradient Boosting (SelGB), an algorithm addressing the Learning-to-Rank task by focusing on those irrelevant documents that are most likely to be mis-ranked, thus severely hindering the quality of the learned model. SelGB exploits a novel technique minimizing the mis-ranking risk, i.e., the probability that two randomly drawn instances are ranked incorrectly, within a gradient boosting process that iteratively generates an additive ensemble of decision trees. Specifically, at every iteration and on a per query basis, SelGB selectively chooses among the training instances a small sample of negative examples enhancing the discriminative power of the learned model. Reproducible and comprehensive experiments conducted on a publicly available dataset show that SelGB exploits the diversity and variety of the negative examples selected to train tree ensembles that outperform models generated by state-of-the-art algorithms by achieving improvements of NDCG@10 up to 3.2%.Source: International ACM Conference on Research and Development in Information Retrieval (SIGIR), pp. 155–164, 8-12/07/2018
DOI: 10.1145/3209978.3210048
DOI: 10.5281/zenodo.2668014
DOI: 10.5281/zenodo.2668013
Project(s): BigDataGrapes via OpenAIRE

, MASTER

, SoBigData via OpenAIRE

Metrics:

2018 Conference article Open Access

Efficient and effective query expansion for web search
Lucchese C., Nardini F. M., Perego R., Trani R., Venturini R.
Query Expansion (QE) techniques expand the user queries with additional terms, e.g., synonyms and acronyms, to enhance the system recall. State-of-the-art solutions employ machine learning methods to select the most suitable terms. However, most of them neglect the cost of processing the expanded queries, thus selecting effective, yet very expensive, terms. The goal of this paper is to enable QE in scenarios with tight time constraints proposing a QE framework based on structured queries and efficiency-aware term selection strategies. In particular, the proposed expansion selection strategies aim at capturing the efficiency and the effectiveness of the expansion candidates, as well as the dependencies among them. We evaluate our proposals by conducting an extensive experimental assessment on real-world search engine data and public TREC data. Results confirm that our approach leads to a remarkable efficiency improvement w.r.t. the state-of-the-art: a reduction of the retrieval time up to 30 times, with only a small loss of effectiveness.Source: International Conference on Information and Knowledge Management (CIKM), pp. 1551–1554, 22-26/10/2018
DOI: 10.1145/3269206.3269305
DOI: 10.5281/zenodo.2668248
DOI: 10.5281/zenodo.2668249
Project(s): BigDataGrapes via OpenAIRE

Metrics:

2018 Journal article Open Access

SEL: a unified algorithm for salient entity linking
Trani S., Lucchese C., Perego R., Losada D. E., Ceccarelli D., Orlando S.
The entity linking task consists in automatically identifying and linking the entities mentioned in a text to their uniform resource identifiers in a given knowledge base. This task is very challenging due to its natural language ambiguity. However, not all the entities mentioned in the document have the same utility in understanding the topics being discussed. Thus, the related problem of identifying the most relevant entities present in the document, also known as salient entities (SE), is attracting increasing interest. In this paper, we propose salient entity linking, a novel supervised 2-step algorithm comprehensively addressing both entity linking and saliency detection. The first step is aimed at identifying a set of candidate entities that are likely to be mentioned in the document. The second step, besides detecting linked entities, also scores them according to their saliency. Experiments conducted on 2 different data sets show that the proposed algorithm outperforms state-of-the-art competitors and is able to detect SE with high accuracy. Furthermore, we used salient entity linking for extractive text summarization. We found that entity saliency can be incorporated into text summarizers to extract salient sentences from text. The resulting summarizers outperform well-known summarization systems, proving the importance of using the SE information.Source: Computational intelligence 34 (2018): 2–29. doi:10.1111/coin.12147
DOI: 10.1111/coin.12147
Project(s): SoBigData via OpenAIRE

Metrics:

See at: ISTI Repository Open Access | Computational Intelligence Restricted | onlinelibrary.wiley.com | CNR ExploRA

2017 Conference article Restricted

SELEcTor: Discovering Similar Entities on LinkEd DaTa by Ranking Their Features
Ruback L., Casanova M. A., Renso C., Lucchese C.
Several approaches have been used in the last years to compute similarity between entities. In this paper, we present a novel approach to compute similarity between entities using their features available as Linked Data. The key idea of the proposed framework, called SELEcTor, is to exploit ranked lists of features extracted from Linked Data sources as a representation of the entities we want to compare. The similarity between two entities is thus mapped to the problem of comparing two ranked lists. Our experiments, conducted with museum data from DBpedia, demonstrate that SELEcTor achieves better accuracy than state- of-the-art methods.Source: ICSC 2017 - IEEE 11th International Conference on Semantic Computing, pp. 117–124, San Diego, CA, USA, 30 January-2 February 2017
DOI: 10.1109/icsc.2017.46
Metrics:

See at: doi.org Restricted | ieeexplore.ieee.org | CNR ExploRA

2017 Journal article Open Access

Fast connected components computation in large graphs by vertex pruning
Lulli A., Carlini E., Dazzi P., Lucchese C., Ricci L.
Finding connected components is a fundamental task in applications dealing with graph analytics, such as social network analysis, web graph mining and image processing. The exponentially growing size of today's graphs has required the definition of new computational models and algorithms for their efficient processing on highly distributed architectures. In this paper we present CRACKER, an efficient iterative MapReduce-like algorithm to detect connected components in large graphs. The strategy of CRACKER is to transform the input graph in a set of trees, one for each connected component in the graph. Nodes are iteratively removed from the graph and added to the trees, reducing the amount of computation at each iteration. We prove the correctness of the algorithm, evaluate its computational cost and provide an extensive experimental evaluation considering a wide variety of synthetic and real-world graphs. The experimental results show that CRACKER consistently outperforms state-of-the-art approaches both in terms of total computation time and volume of messages exchanged.Source: IEEE transactions on parallel and distributed systems (Print) 28 (2017): 760–773. doi:10.1109/TPDS.2016.2591038
DOI: 10.1109/tpds.2016.2591038
Metrics:

See at: Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari Open Access | ISTI Repository | IEEE Transactions on Parallel and Distributed Systems Restricted | ieeexplore.ieee.org | CNR ExploRA

2017 Conference article Restricted

On including the user dynamic in learning to rank
Ferro N., Lucchese C., Maistro M., Perego R.
Ranking query results effectively by considering user past behaviour and preferences is a primary concern for IR researchers both in academia and industry. In this context, LtR is widely believed to be the most effective solution to design ranking models that account for user-interaction features that have proved to remarkably impact on IR effectiveness. In this paper, we explore the possibility of integrating the user dynamic directly into the LtR algorithms. Specifically, we model with Markov chains the behaviour of users in scanning a ranked result list and we modify Lambdamart, a state-of-the-art LtR algorithm, to exploit a new discount loss function calibrated on the proposed Markovian model of user dynamic. We evaluate the performance of the proposed approach on publicly available LtR datasets, finding that the improvements measured over the standard algorithm are statistically significant.Source: SIGIR '17 - 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1041–1044, Shinjuku, Tokyo, Japan, 7 - 11 August, 2017
DOI: 10.1145/3077136.3080714
Metrics:

See at: doi.acm.org Restricted | doi.org | CNR ExploRA

2017 Conference article Restricted

RankEval: an evaluation and analysis framework for learning-to-rank solutions
Lucchese C., Muntean C. I., Nardini F. M., Perego R., Trani S.
In this demo paper we propose RankEval, an open-source tool for the analysis and evaluation of Learning-to-Rank (LtR) models based on ensembles of regression trees. Gradient Boosted Regression Trees (GBRT) is a flexible statistical learning technique for classification and regression at the state of the art for training effective LtR solutions. Indeed, the success of GBRT fostered the development of several open-source LtR libraries targeting efficiency of the learning phase and effectiveness of the resulting models. However, these libraries offer only very limited help for the tuning and evaluation of the trained models. In addition, the implementations provided for even the most traditional IR evaluation metrics differ from library to library, thus making the objective evaluation and comparison between trained models a difficult task. RankEval addresses these issues by providing a common ground for LtR libraries that offers useful and interoperable tools for a comprehensive comparison and in-depth analysis of ranking models.Source: SIGIR '17 - 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1281–1284, Tokyo, Japan, 9-11 August 2017
DOI: 10.1145/3077136.3084140
Project(s): SoBigData via OpenAIRE

Metrics:

See at: dl.acm.org Restricted | doi.org | CNR ExploRA

2017 Conference article Restricted

X-DART: blending dropout and pruning for efficient learning to rank
Lucchese C., Nardini F. M., Orlando S., Perego R., Trani S.
In this paper we propose X-DART, a new Learning to Rank algorithm focusing on the training of robust and compact ranking models. Motivated from the observation that the last trees of MART models impact the prediction of only a few instances of the training set, we borrow from the DART algorithm the dropout strategy consisting in temporarily dropping some of the trees from the ensemble while new weak learners are trained. However, differently from this algorithm we drop permanently these trees on the basis of smart choices driven by accuracy measured on the validation set. Experiments conducted on publicly available datasets shows that X-DART outperforms DART in training models providing the same effectiveness by employing up to 40% less trees.Source: SIGIR '17 - 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1077–1080, Tokyo, Japan, 9-11 August 2017
DOI: 10.1145/3077136.3080725
Metrics:

See at: doi.acm.org Restricted | doi.org | CNR ExploRA

2017 Conference article Restricted

Multicore/Manycore parallel traversal of large forests of regression trees
Lettich F., Lucchese C., Nardini F. M., Orlando S., Perego R., Tonellotto N., Venturini R.
Machine-learnt models based on additive ensembles of binary regression trees are currently considered one of the best solutions to address complex classification, regression, and ranking tasks. To evaluate these complex models over a continuous stream of data items with high throughput requirements, we need to optimize, and possibly parallelize, the traversal of thousands of trees, each including hundreds of nodes.Document ranking in Web Search is a typical example of this challenging scenario, where complex tree-based models are used to score query-document pairs and finally rank lists of document results for each incoming query (a.k.a. Learning-to-Rank). In this extended abstract, we briefly discuss some preliminary results concerning the parallelization strategies for QUICKSCORER - indeed the state-of-art scoring algorithm that exploits ensembles of decision trees - by using multicore CPUs (with SIMD coprocessors) and manycore GPUs. We show that QUICKSCORER, which transforms the traversal of thousands of decision trees in a linear access to array data structures, can be parallelized very effectively, by achieving very interesting speedups.Source: HPCS 2017 - International Conference on High Performance Computing & Simulation, pp. 915–915, Genoa, Italy, 17-21 July 2017
DOI: 10.1109/hpcs.2017.154
Metrics:

See at: doi.org Restricted | ieeexplore.ieee.org | CNR ExploRA

2017 Conference article Restricted

LEARning Next gEneration Rankers (LEARNER 2017)
Ferro N., Lucchese C., Maistro M., Perego R.
The aim of LEARNER@ICTIR2017 is to investigate new solutions for LtR. In details, we identify some research areas related to LtR which are of actual interest and which have not been fully explored yet. We solicit the submission of position papers on novel LtR algorithms, on evaluation of LtR algorithms, on dataset creation and curation, and on domain specific applications of LtR. LEARNER@ICTIR2017 will be a gathering of academic people interested in IR, ML and related application areas. We believe that the proposed workshop is relevant to ICTIR since we look for novel contributions to LtR focused on foundational and conceptual aspects, which need to be properly framed and modeled.Source: ICTIR '17 - ACM SIGIR International Conference on Theory of Information Retrieval, pp. 331–332, Amsterdam, The Netherlands, 1-4 October, 2017
DOI: 10.1145/3121050.3121110
Project(s): SoBigData via OpenAIRE

Metrics:

See at: doi.acm.org Restricted | doi.org | CNR ExploRA

2017 Conference article Restricted

Efficiency/Effectiveness trade-offs in learning to rank
Lucchese C., Nardini F. M.
In the last years, Learning to Rank (LtR) had a significant influence on several tasks in the Information Retrieval field, with large research efforts coming both from the academia and the industry. Indeed, efficiency requirements must be fulfilled in order to make an effective research product deployable within an industrial environment. The evaluation of a model can be too expensive due to its size, the features used and several other factors. This tutorial discusses the recent solutions that allow to build an effective ranking model that satisfies temporal budget constrains at evaluation time.Source: ICTIR 2017 - 3rd ACM International Conference on the Theory of Information Retrieval, pp. 329–330, Amsterdam, The Netherlands, 1-4 October, 2017
DOI: 10.1145/3121050.3121109
Metrics:

See at: doi.acm.org Restricted | doi.org | CNR ExploRA

2017 Journal article Open Access

Perception of social phenomena through the multidimensional analysis of online social networks
Coletto M., Esuli A., Lucchese C., Muntean C. I., Nardini F. M., Perego R., Renso C.
We propose an analytical framework aimed at investigating different views of the discussions regarding polarized topics which occur in Online Social Networks (OSNs). The framework supports the analysis along multiple dimensions, i.e., time, space and sentiment of the opposite views about a controversial topic emerging in an OSN. To assess its usefulness in mining insights about social phenomena, we apply it to two different Twitter case studies: the discussions about the refugee crisis and the United Kingdom European Union membership referendum. These complex and contended topics are very important issues for EU citizens and stimulated a multitude of Twitter users to take side and actively participate in the discussions. Our framework allows to monitor in a scalable way the raw stream of relevant tweets and to automatically enrich them with location information (user and mentioned locations), and sentiment polarity (positive vs. negative). The analyses we conducted show how the framework captures the differences in positive and negative user sentiment over time and space. The resulting knowledge can support the understanding of complex dynamics by identifying variations in the perception of specific events and locations.Source: Online social networks and media 1 (2017): 14–32. doi:10.1016/j.osnem.2017.03.001
DOI: 10.1016/j.osnem.2017.03.001
Project(s): SoBigData via OpenAIRE

Metrics:

See at: ISTI Repository Open Access | Online Social Networks and Media Restricted | www.sciencedirect.com | CNR ExploRA

2017 Conference article Open Access

QuickScorer: efficient traversal of large ensembles of decision trees
Lucchese C., Nardini F. M., Orlando S., Perego R., Tonellotto N., Venturini R.
Machine-learnt models based on additive ensembles of binary regression trees are currently deemed the best solution to address complex classification, regression, and ranking tasks. Evaluating these models is a computationally demanding task as it needs to traverse thousands of trees with hundreds of nodes each. The cost of traversing such large forests of trees significantly impacts their application to big and stream input data, when the time budget available for each prediction is limited to guarantee a given processing throughput. Document ranking in Web search is a typical example of this challenging scenario, where the exploitation of tree-based models to score query-document pairs, and finally rank lists of documents for each incoming query, is the state-of-art method for ranking (a.k.a. Learning-to-Rank). This paper presents QuickScorer, a novel algorithm for the traversal of huge decision trees ensembles that, thanks to a cache- and CPU-aware design, provides a 9 speedup over best competitors.Source: ECML PKDD - Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 383–387, Skopje, Macedonia, 18-22 September, 2017
DOI: 10.1007/978-3-319-71273-4_36
Metrics:

See at: arpi.unipi.it Open Access | doi.org Restricted | link.springer.com | CNR ExploRA

2017 Conference article Restricted

Sentiment spreading: an epidemic model for lexicon-based sentiment analysis on Twitter
Pollacci L., Sirbu A., Giannotti F., Pedreschi D., Lucchese C., Muntean C. I.
While sentiment analysis has received significant attention in the last years, problems still exist when tools need to be applied to microblogging content. This because, typically, the text to be analysed consists of very short messages lacking in structure and semantic context. At the same time, the amount of text produced by online platforms is enormous. So, one needs simple, fast and effective methods in order to be able to efficiently study sentiment in these data. Lexicon-based methods, which use a predefined dictionary of terms tagged with sentiment valences to evaluate sentiment in longer sentences, can be a valid approach. Here we present a method based on epidemic spreading to automatically extend the dictionary used in lexicon-based sentiment analysis, starting from a reduced dictionary and large amounts of Twitter data. The resulting dictionary is shown to contain valences that correlate well with human-annotated sentiment, and to produce tweet sentiment classifications comparable to the original dictionary, with the advantage of being able to tag more tweets than the original. The method is easily extensible to various languages and applicable to large amounts of data.Source: AI*IA Conference of the Italian Association for Artificial Intelligence, pp. 114–127, Bari, Italy, 14-17 November 2017
DOI: 10.1007/978-3-319-70169-1_9
Project(s): SoBigData via OpenAIRE

Metrics:

See at: Lecture Notes in Computer Science Restricted | Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari | link.springer.com | CNR ExploRA

2016 Conference article Open Access

On the behaviour of deviant communities in online social networks
Coletto M., Aiello L. M., Lucchese C., Silvestri F.
On-line social networks are complex ensembles of inter-linked communities that interact on different topics. Some communities are characterized by what are usually referred to as deviant behaviours, conducts that are commonly considered inappropriate with respect to the society's norms or moral standards. Eating disorders, drug use, and adult content consumption are just a few examples. We refer to such communities as deviant networks. It is commonly believed that such deviant networks are niche, isolated social groups, whose activity is well separated from the mainstream social media life. According to this assumption, research studies have mostly considered them in isolation. In this work we focused on adult content consumption networks, which are present in many on-line social media and in the Web in general. We found that few small and densely connected communities are responsible for most of the content production. Differently from previous work, we studied how such communities interact with the whole social network. We found that the produced content flows to the rest of the network mostly directly or through bridge-communities, reaching at least 450 times more users.We also show that a large fraction of the users can be inadvertently exposed to such content through indirect content resharing. We also discuss a demographic analysis of the producers and consumers networks. Finally, we show that it is easily possible to identify a few core users to radically uproot the diffusion process. We aim at setting the basis to study deviant communities in context.Source: ICWSM 2016 - Tenth International AAAI Conference on Web and Social Media, pp. 72, Cologne, Germany, 17-20 May 2016

See at: www.aaai.org Open Access | CNR ExploRA

2016 Conference article Restricted

Post-learning optimization of tree ensembles for efficient ranking
Lucchese C., Perego R., Nardini F. M., Silvestri F., Orlando S., Trani S.
Learning to Rank (LtR) is the machine learning method of choice for producing high quality document ranking functions from a ground-truth of training examples. In practice, efficiency and effectiveness are intertwined concepts and trading off effectiveness for meeting efficiency constraints typically existing in large-scale systems is one of the most urgent issues. In this paper we propose a new framework, named CLEaVER, for optimizing machine-learned ranking models based on ensembles of regression trees. The goal is to improve efficiency at document scoring time without affecting quality. Since the cost of an ensemble is linear in its size, CLEaVER first removes a subset of the trees in the ensemble, and then fine-tunes the weights of the remaining trees according to any given quality measure. Experiments conducted on two publicly available LtR datasets show that CLEaVER is able to prune up to 80% of the trees and provides an efficiency speed-up up to 2.6x without affecting the effectiveness of the model.Source: 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 949–952, Pisa, Italy, 17-21 July 2016
DOI: 10.1145/2911451.2914763
Project(s): SoBigData via OpenAIRE

Metrics:

See at: dl.acm.org Restricted | doi.org | CNR ExploRA

2016 Conference article Restricted

SEL: A unified algorithm for entity linking and saliency detection
Trani S., Ceccarelli D., Lucchese C., Orlando S., Perego R.
The Entity Linking task consists in automatically identifying and linking the entities mentioned in a text to their URIs in a given Knowledge Base, e.g., Wikipedia. Entity Linking has a large impact in several text analysis and information retrieval related tasks. This task is very challenging due to natural language ambiguity. However, not all the entities mentioned in a document have the same relevance and utility in understanding the topics being discussed. Thus, the related problem of identifying the most relevant entities present in a document, also known as Salient Entities, is attracting increasing interest. In this paper we propose SEL, a novel supervised two-step algorithm comprehensively addressing both entity linking and saliency detection. The first step is based on a classifier aimed at identifying a set of candidate entities that are likely to be mentioned in the document, thus maximizing the precision of the method without hindering its recall. The second step is still based on machine learning, and aims at choosing from the previous set the entities that actually occur in the document. Indeed, we tested two different versions of the second step, one aimed at solving only the entity linking task, and the other that, besides detecting linked entities, also scores them according to their saliency. Experiments conducted on two different datasets show that the proposed algorithm outperforms state-of-the-art competitors, and is able to detect salient entities with high accuracy.Source: ACM Symposium on Document Engineering, pp. 85–94, Vienna, Austria, 13-16 September 2016
DOI: 10.1145/2960811.2960819
Project(s): SoBigData via OpenAIRE

Metrics:

See at: dl.acm.org Restricted | doi.org | CNR ExploRA

2016 Contribution to conference Restricted

Improve ranking efficiency by optimizing tree ensembles
Lucchese C., Nardini F. M., Orlando S., Perego R., Silvestri F., Trani S.
Learning to Rank (LtR) is the machine learning method of choice for producing highly effective ranking functions. However, efficiency and effectiveness are two competing forces and trading off effiectiveness for meeting efficiency constraints typical of production systems is one of the most urgent issues. This extended abstract shortly summarizes the work in [4] proposing CLEaVER, a new framework for optimizing LtR models based on ensembles of regression trees. We summarize the results of a comprehensive evaluation showing that CLEaVER is able to prune up to 80% of the trees and provides an efficiency speed-up up to 2:6x without affecting the effectiveness of the model.Source: 7th Italian Information Retrieval Workshop, Venezia, Italia, 30-31 May 2016

See at: ceur-ws.org Restricted | CNR ExploRA

2016 Journal article Open Access

Quality versus efficiency in document scoring with learning-to-rank models
Capannini G., Lucchese C., Nardini F. M., Orlando S., Perego R., Tonellotto N.
Learning-to-Rank (LtR) techniques leverage machine learning algorithms and large amounts of training data to induce high-quality ranking functions. Given a set of documents and a user query, these functions are able to precisely predict a score for each of the documents, in turn exploited to effectively rank them. Although the scoring efficiency of LtR models is critical in several applications - e.g., it directly impacts on response time and throughput of Web query processing - it has received relatively little attention so far. The goal of this work is to experimentally investigate the scoring efficiency of LtR models along with their ranking quality. Specifically, we show that machine-learned ranking models exhibit a quality versus efficiency trade-off. For example, each family of LtR algorithms has tuning parameters that can influence both effectiveness and efficiency, where higher ranking quality is generally obtained with more complex and expensive models. Moreover, LtR algorithms that learn complex models, such as those based on forests of regression trees, are generally more expensive and more effective than other algorithms that induce simpler models like linear combination of features. We extensively analyze the quality versus efficiency trade-off of a wide spectrum of state-of-the-art LtR, and we propose a sound methodology to devise the most effective ranker given a time budget. To guarantee reproducibility, we used publicly available datasets and we contribute an open source C++ framework providing optimized, multi-threaded implementations of the most effective tree-based learners: Gradient Boosted Regression Trees (GBRT), Lambda-Mart (Î»-MART), and the first public-domain implementation of Oblivious Lambda-Mart (Î©Î»-MART), an algorithm that induces forests of oblivious regression trees. We investigate how the different training parameters impact on the quality versus efficiency trade-off, and provide a thorough comparison of several algorithms in the quality-cost space. The experiments conducted show that there is not an overall best algorithm, but the optimal choice depends on the time budget.Source: Information processing & management 52 (2016): 1161–1177. doi:10.1016/j.ipm.2016.05.004
DOI: 10.1016/j.ipm.2016.05.004
Project(s): SoBigData via OpenAIRE

Metrics: